Fine-tuning protein embeddings for functional similarity evaluation

Abstract Motivation Proteins with unknown function are frequently compared to better characterized relatives, either using sequence similarity, or recently through similarity in a learned embedding space. Through comparison, protein sequence embeddings allow for interpretable and accurate annotation of proteins, as well as for downstream tasks such as clustering for unsupervised discovery of protein families. However, it is unclear whether embeddings can be deliberately designed to improve their use in these downstream tasks. Results We find that for functional annotation of proteins, as represented by Gene Ontology (GO) terms, direct fine-tuning of language models on a simple classification loss has an immediate positive impact on protein embedding quality. Fine-tuned embeddings show stronger performance as representations for K-nearest neighbor classifiers, reaching stronger performance for GO annotation than even directly comparable fine-tuned classifiers, while maintaining interpretability through protein similarity comparisons. They also maintain their quality in related tasks, such as rediscovering protein families with clustering. Availability and implementation github.com/mofradlab/go_metric

Materials.Machine learning training executed on NVIDIA RTX A6000 GPU, using pytorch and pytorch-lightning frameworks.
Experiments run through python scripts on lab servers.All code included in github.com/amdson/go_metricrepository.

Methods.
.1.F-Max.The classical F1 score is defined for discrete classification predictions on discrete labels, while model outputs are continuous logit values, so before evaluation GO predictions are thresholded by an arbitrary value α.Logit values above α are set as positive predictions, while those below are set as negative.For a given α, define precision as the number of true positive predictions divided by the number of positive predictions and recall as the number of true positive predictions divided by the number positive annotations, with summations over all class labels.
.2. S-min.Information content is a quantitative measure of the descriptive value of a gene ontology term, defined using the GO graph structure (2).It associates an importance weighting ic(g) to each GO term g ∈ Gn, representing the amount of information carried by a positive or negative annotation to the term.
With information content, define remaining uncertainty and missing information as the following.
The final S-min score is .3.F-Score.The F-Score metric is defined to approximate the true F1 score with perfect knowledge of datapoint labels (3).It's based on recall rc(α), and the probability of a positive prediction As with F-max, we take the maximum over all thresholds α for our results.
Baseline Deep Convolutional Model.Convolutional model is a pytorch implementation of the multi-filter model implemented in original DeepGOPlus paper (4).The final model is built out of three main components, and takes proteins as represented by residue sequences of any length as input.
Before processing, a one-hot encoder converts the 22 main protein residue types into binary vector representations, so that protein sequences are input as (L x 22) binary matrices.
Input sequence matrices are first processed by a series of 1D convolutional operators, with a kernels ranging from 3 to 129 in size, and with 800 filters for each kernel.Convolutional operators are initialized with weights from a Xavier distribution and zero bias.Each convolutional operator independently produces an (L x 800) product of the input, which is reduced by pool-max to an 800-dim vector with size independent of sequence length.Vector outputs from each convolutional operator are concatenated to produce a 10,000-dim representation.

Fig. S3 .
Fig. S3.Scatterplots of precision or recall for each of 865 GO classes included in dataset.X-axis gives precision or recall for baseline BLAST model, while Y-axis shows corresponding value on same GO class for a fine-tuned or embedding model.Points below red line have higher precision or recall when annotated by BLAST.

Fig. S4 .
Fig. S4.(A) Performance of selected models on 50% subset of GO terms used for training.(B) Performance of query based models on 50% subset of GO terms excluded from training.(C) Performance of all models on sequences with no significant BLAST matches in training data.